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Converses for Distributed Estimation 
via Strong Data Processing Inequalities 

Aolin Xu and Maxim Raginsky 


Abstract —We consider the problem of distributed estimation, 
where local processors observe independent samples conditioned 
on a common random parameter of interest, map the observa¬ 
tions to a finite number of bits, and send these bits to a remote 
estimator over independent noisy channels. We derive converse 
results for this problem, such as lower bounds on Bayes risk. 
The main technical tools include a lower bound on the Bayes 
risk via mutual information and small ball probability, as well 
as strong data processing inequalities for the relative entropy. 
Our results can recover and Improve some existing results on 
distributed estimation with noiseless channels, and also capture 
the effect of noisy channels on the estimation performance. 

Index Terms —Distributed estimation, Bayes risk, strong data 
processing inequalities. 

I. Introduction 

The problem of distributed estimation arises when the 
estimator does not have direct access to the samples generated 
according to the parameter of interest, but only to the data 
received from local processors that observe the samples. In this 
work, we consider a general model of distributed estimation, 
where each of the m processors observes n independent 
samples drawn conditionally on a common d-dimensional 
parameter, generates a &-bit quantized message, and sends it 
to a remote estimator with T uses of an independent noisy 
channel. We derive lower bounds on the Bayes risk and on 
the minimum b or T needed to achieve a certain Bayes risk. 
Fundamental limits of similar problems have been studied 
recently by Duchi et al. [1] and Shamir [2] with the assumption 
of noiseless channels (cf. also earlier work by Gallager [3] and 
by Han and Amari [4]). 

To some extent, the parameter to be estimated in the 
problem under consideration can be viewed as a message 
to be sent in a transmission system, and the samples to be 
processed and quantized can be viewed as the input data to a 
compression system. However, a few important features make 
the problem distinct from data compression and transmission. 
First, the dimension of the parameter may be fixed and 
not grow with the number of channel uses. Second, due to 
communication and computation constraints, the number of 
bits in the quantized message may not grow with the sample 
size. For example, as pointed out in [4], the samples can 
be compressed at asymptotically zero rate, which makes it 

The authors are with the Department of Electrical and Computer Engineer¬ 
ing and the Coordinated Science Laboratory, University of Illinois, Urbana, 
IL 61801, USA. E-mails: {aolinxu2,maxim}©Illinois.edu. 

Research supported in part by the NSF under award no. CCF-1017564, 
by CAREER award no. CCF-1254041, and by the Center for Science of 
Information (CSoI), an NSF Science and Technology Center, under grant 
agreement CCF-0939370. 


impossible to reconstruct the samples, yet still suffices to 
reliably estimate the parameter. The number of bits in the 
quantized message may not grow with the number of channel 
uses either. Due to these features, the conventional coding 
theorems in information theory cannot be applied here, but 
we can still use information-theoretic techniques to derive 
fundamental limits for the general problem of distributed 
estimation. 

One of the major tools we use is a lower bound on the Bayes 
risk in terms of mutual information and small ball probability, 
which we derive using techniques introduced in our earlier 
work [5]. Another major tool is the strong data processing 
inequality (SDPI) for relative entropy [6]-[8], which lets us 
quantify the contraction of mutual information caused by 
communication constraints. 

The general results we obtain are non-asymptotic in d, n, 
b, T and m, and can be used to derive asymptotic results. 
Examples are given for estimating both discrete and continu¬ 
ous parameters, where the converses closely match achievable 
performance. Moreover, our results can be naturally applied 
to minimax lower bounds, since the latter are always lower- 
bounded by the Bayes risk. We start with the single-processor 
setting, and then generalize the results to the multi-processor 
setting. We are able to recover and improve some existing 
results on distributed estimation with noiseless channels [1] as 
special cases, while our general results can capture the effect 
of noisy channels on the estimation performance. 

II. Main tools 

A lower bound on Bayes risk. In the standard Bayesian 
estimation framework, V = {Px\w=w : re G W} is a family 
of distributions on an observation space X, and the parameter 
space W is endowed with a prior Pw- We estimate W from 
X ^ Px\w W — via an estimator ip. Given a 

distortion function £ : W x W —K+, define the Bayes risk 

Rb = infEr£(VF,lV)l. 

i/) 

For a given ip, the excess distortion probability P(£(1F, W) > 
p) can be lower bounded in terms of the mutual information 
I{W]W) and the so-called small ball probability of W with 
respect to distortion function £ [5], defined as 

C{W,p) = sup P(£(1F, w) < p). 

u)GW 

This quantity measures the “spread” of the prior distribution 
Pw- The lower bound on P(£(1F,1T) > p) can be conve¬ 
niently converted to a lower bound on E[£(1T, W)] through 


Markov’s inequality. Using the techniques from our earlier 
work [5], we obtain the following lower bound on the Bayes 
risk (see Appendix A for the proof): 


For product input distributions and product channels, the SDPI 
constant tensorizes [7] (see [8] for a more general result for 
other /-divergences): 


Theorem 1. In the above Bayesian estimation framework. 


Rb > sup p 

p>0 


f I(W-,X)+ log2 \ 

[ iog(i/i:(w,p)) J ■ 


Similar methods to derive Bayes risk lower bounds have 
been recently proposed by Chen et al. [9], where they obtained 
lower bounds in terms of general /-informativities [10] and 
a quantity essentially the same as the small ball probability. 
Theorem 1 reveals two sources of the intrinsic difficulty of 
estimating W: the amount of information about W contained 
in the observation X, captured by I{W\X), and the spread 
of the prior distribution Pw, captured by CiW,-). When 
an estimator does not have direct access to X but only 
through one or more local processors, the mutual information 
between W and the estimator’s indirect observations will 
be a contraction of I{W\X). The contraction is caused by 
the communication constraints between the local processors 
and the estimator, such as storage limitations of intermediate 
results, limited transmission blocklength, channel noise, etc. 


Contraction of mutual information via SDPI. We quantify the 
contraction of mutual information using strong data processing 
inequalities for the relative entropy (see [8] and references 
therein). Given a stochastic kernel (channel) K with input 
alphabet X and output alphabet Y, and a reference input 
distribution p on X, we say that K satisfies an SDPI at p 
with constant c S [Oj 1) if D{vK\\pK) < cD{y\\p) for 
any other input distribution v on X. Here, pK denotes the 
marginal distribution of the channel output when the input 
has distribution p. The tightest such constants, 

, D{vK\\pK) ^ 

r]{p,K)= sup II , , r]{K)= sup r]{p,K), 

D{v\\p) f. 


are also the maximum contraction ratios of mutual information 
in a Markov chain [7]: for a Markov chain W — X — Y, 


sup 

Pw\X 


IiW;Y) 

I{W;X) 


v{Px,Py\x) 


( 1 ) 


if the joint distribution Px,y is fixed, and 


Lemma 2. For distributions pi,...,pn on X and channels 
Ki ,..., Kn with input alphabet X, 

r]{pi (g)... (g) /x„, ATi (g)... (g) Kn) = max r]{pi,Ki). 

l<i<n 

Finally, motivated by Evans and Schulman [12] and by Polyan¬ 
skiy and Wu [13], the following lemma characterizes the SDPI 
constant for multiple uses of a channel (see Appendix B for 
the proof): 

Lemma 3. For a stochastic kernel Pv\u> consider the sta¬ 
tionary and memoryless channel PyT^jjT = Pyt^jj. The SDPI 
constant of PyT^jjT satisfies 

piPyTiuT) < 1 - (1 - 7?(Py|c/))^. 

III. Results for a single processor 

Consider the following distributed estimation problem with 
a single processor, shown schematically in Fig. 1: 



■W 


Fig. 1. General model (single processor). 


» W — {Wi ,..., Wd) is a random parameter (discrete or 
continuous) with mutually independent coordinates. 

• The d X n array of observations X'^ is generated con¬ 
ditionally on W as follows: For each j G [d], given 
Wj = Wj, the n samples in the /th row of X”, denoted 
by X” = {Xj i,... ,Xj^n), independently generated 
according to a given stochastic kernel Px^ \ w, =wj ■ 

• The local processor observes X" and generates a 6-bit 
message Y = (pi(X”). 

• The encoder maps X to a codeword U'^ = ip 2 iY) with 
blocklength T, and transmits U'^ over the noisy channel. 
The channel is memoryless, with stochastic kernel Py\u. 

• The remote estimator fii estimates W from the received 
codeword V'^, so that W = 

The Bayes risk in this problem setup is defined as 


I{W;Y) 

ZIHW;X) 


= v{Py\x) 


( 2 ) 


if only the channel Py\x is fixed. It is generally hard to 
precisely compute the SDPI constant for an arbitrary pair of 
p and K, except for some special cases. One such case is 
that for binary symmetric channel, r 7 (Bern(f),BSC(e)) = 
p(BSC(e)) = (1 — 2e)^ [11]. Various upper bounds on SDPI 
constants have been proposed (see [8] and references therein). 
We will need one such bound [6]: 


Lemma 1. Define the Dobrushin contraction coefficient of 
a channel Px\w by -diPxiw) = \\Px\w=w - 

Px\w=w'\\tx- Then 'q{Px\w) < '^{Px\w)- 


Rb= inf E[f(kF,/>(U^))]. 

In order to apply Theorem 1 , we need an upper bound on the 
mutual information I{W\V'^) which is independent of pi, 
ip 2 , and fi. All logarithms are binary, unless stated otherwise. 

Theorem 2. For any choice of pi, ip 2 , cind fi), 

I(W]V'^) <min| (i7(X”) A 6) Tyr max rj^Px^, Pwpx-^), 
t l<j<d ^ ^ 

/(lU;X”)r7T,Cr} 

where r A s=min{r, s}, tyr—1 — (1 — d[Pv\u))'^> ‘^nd C is 
the Shannon capacity of the channel Py\ir- 















Proof: Consider the Markov chain W — X" — Y— C/^ — 
V^. From Lemma 3, Eq. (2), and the ordinary DPI, we have 

I(W;V^) < I(W;U^)tjt < I(W;Y)yT. (3) 

On the one hand, 

I(W;Y)<I(X^;YMPx^,P^,lx^) (4) 

as a consequence of (1). Lemma 2 and the fact that 
(FLi, X"),..., (Wd, Xf) are independent imply that 

ry(Px", Pw\x- ) = max^ v{Px^, Pw,\x^ ) 

Finally, since Y takes values in [2^], 

I{X^;Y) < mm{H{X^),H{Y)} < min{iJ(X”), 6}. 
Using these bounds in (3) and (4), we get 

I{W;V'^) < (iF(X") A b) m^x^r]{Px^, Pw,\xy)vt- 

Alternatively, using I{W;Y) < /(1U;X") in (3), we get 
I{W;V'^) < I{W-,X'^)rjT. Lastly, because the noisy channel 
is memoryless, we have < CT. 

We complete the proof by taking the minimum of the three 
resulting estimates to get the tightest bound on I{W\ V'^). ■ 
Next we study a few examples of this problem setup to 
illustrate the effectiveness of using Theorem 1 and Theorem 2 
to derive converse results for the Bayes risk. 


uniform on {±1}, and Pwj\Xj k BSC(iy^) as well. Channel 
Py\u is assumed to be arbitrary. 

Corollary 2. In Example 2, for n = 1, 

/(lU; U^) < min |(d A 5)^^77r, Ct|. (5) 

For n > 1, with and 

I{W]V’^) < min I [d{l + nh{^)) A b) ^nVT, dr]T, CPy 

Proof: For n = 1, we have the exact SDPI constant 
r]{Pxj:Pw \x ) = 5^7 due to the fact that Xj is uniform on 
{±1} and Pw,\x, is BSC(i^). 

For n > 1, by Lemma 1, 

'n{Px^iPwj\x^) < v{Pwj\xj) < 'd{Pwj\x^)i (6) 

where the Dobrushin coefficient is computed in Appendix C 
to be ^{Py/.^x^) = (1 ~/3”)/(l + |S"). We also have 
/(VL;X") < d, and 

id(X") = dHiX'f) < dH{Wi,X^) 

= d{H{Wi) + H{X^\Wi)) = d{l + nh{^)). 

The results then follow from Theorem 2. ■ 

Duchi et al. [1] considered the same problem with n = 1 
and noiseless Pv\u- Their result (Lemma 3 in [1]), proved in 
a much more complicated way, shows that 

I{W] Y) < mm{d, &}32(5V(1 - (7) 


Example 1: Transmitting a bit over a BSC. Suppose W is 

Bern(i), W = X” = Y, Pv\u is BSC(e), so that r]{Pv\u) = 
(1 — 2e)^, and £{w, w) = l{w w}. 


Corollary 1. The minimum blocklength T* to achieve Rb < p 
satisfies 


log AM > log ^ - log log f 
log 4^- " log 4^- 


log^ 

logife 


as p ^ 0, 


where h{-) is the binary entropy function, and e=l — e. 


Proof: In this case, we can bypass Theorem 1 by using 
the bound 1 — h(V{W f=- W)) < /(lU; V'^). Theorem 2 gives 
/(VU; U^) < I{W; X'^)r]T < 1 — (dee)^. We obtain the lower 
bound using the fact that h{p) < plog |. ■ 

The blocklength of a repetition code with error probability 
of at most p gives an upper bound on T*. By the Chemoff 
bound [14], a blocklength-T repetition code can achieve 
P(iu y£W)< 2-T'°«ra, Thus 

.r*<2iogi/iogj^_. 

We see that the upper and lower bounds on T* only differ by 
a factor of 2 as p —> 0, and have the same dependence on e. 


Example 2: Estimating a discrete parameter. Consider the 
case where W is uniformly distributed on {±1}"^ and X" S 
{j_l}(ixn^ Given some fixed S S [0,1], Px^jw^ = 

(1 + XykWj5)/2 for j e {1,..., d} and k G {1,..., n}. In 
other words, Pxj\ is BSC(ly^). It follows that X^-fc is 


where the contraction coefficient is less than 1 only when S < 
0.133. In contrast, the contraction coefficient in (5) can never 
go greater than 1, and it considerably improves the contraction 
coefficient in (7) over all S G [0,1], especially for large 6, 
under the same noiseless channel assumption. Combined with 
Theorem 1 , Corollary 2 can be applied to derive lower bounds 
on the minimax risk in estimating the mean of an arbitrary 
probability distribution on the cube [—1,1]*^. We discuss this 
application in Sec. IV, in the multi-processor setting. 

Using Corollary 2, we can obtain lower bounds on the bit 
error probability for estimating W and on the number of bits 
to quantize the message Y. 

Corollary 3. In Example 2, let £{w,w) = 

a T,‘j=i %}■ Then, for n = I, 

Rb > h~^ ~ ^ CT}^ , 

provided b, d, and T are such that the argument of lies 

in [0,1]. 

Proof: Let (( 2 ('ll ) be the binary divergence function; then, 
choosing (pi,(p2,V’ that attain i?B, we have 

1 _ 

1 - h{RB) = d2(i?B|| i) < ^ ^ W,)\\\) 

f = l 

^ ^ E ^)<\ A CT) , 







where the first line uses the convexity of divergence, and the 
second line uses the data processing inequality for divergence, 
the fact that W/s are i.i.d., and Corollary 2. Applying h~^ to 
both sides, we get the result. ■ 

Corollary 4. In Example 2, for n = 1, to achieve < P, it 
is necessary that 

b ^ 1 - h{p) ^ _ 1 - Hp) _ 

d- S^riT (I-{l-ri{Pv\u)V)' 

In Fig. 2, this lower bound is compared with the asymptotic 
compression ratio R{p) = 1 — 0 < < 

p < 5 , of noisy lossy coding of an i.i.d. Bern(i) source 
over BSC('!^), and also with the rate-distortion function 
R{p) = 1 — h{p) of Bern(i). 


Lower bounds on b/d {p = 0.3) 



Fig. 2. Compaiison of lower bounds on b/d (p = 0.3). 


Example 3: Estimating a continuous parameter. Consider 
the problem of estimating the bias of a Bernoulli random 
variable through a BSC. In this case, W is assumed to be 
uniformly distributed on [0,1], Px\w=w is Bern(r(;), and 
Pv\u is BSC(£). We are interested in lower-bounding the 
Bayes risk with respect to the absolute loss i{w, w) = jtu —{y|. 

Corollary 5. In Example 3, let /*= sup^^ /(VF; C^). 
Then the Bayes risk can be lower-bounded by 

> 42”""* ( 8 ) 

16 

for all values of and by 

Rb > (9) 

for I* —>■ oo. The notation y > g(x) means that there 
exists some function f such that y > f{x) for all x, and 
lim^j^oo f{x)/g{x) = 1 . 


Proof: We have C{W,p) = sup^gjQ P(|fF—u>| < p) < 
min{2p, 1}. From Theorem 1, 


Rb > sup p 

0<p< j 


A /*+log2 \ 
I log(l/2p)y' 


1 „-n+i 
> - sup 52 

2 o<s<i 


( 10 ) 


where the last inequality is obtained by requiring 1 — 
liga/fp^) > ^ ^ dO), taking s = \, 


we get ( 8 ), while optimizing over s and sending I* —> oo, we 
get (9). ■ 

Corollary 6. In Example 3, for any choice o/ f/Ji , V'. 

liW; V'^) < min {^(l - 2"”) (l - (dee)^), 

(ilogn-h 7 „)(l- (dee)^), (1 - ft,(£))T|, 

where lim„_,,oo 7 n = c with some absolute constant a 


Proof: From Lemma 1, 

v{Pw\X”-) <'d{Pw\X'^) = ^ "■ ( 11 ) 

The proof of (11) is in Appendix D. Moreover (see, e.g., [15]) 

/(lF;X") = ^logn + 7n. 

With these facts, the result follows from Theorem 2. ■ 

Now we apply Corollaries 5 and 6 to two specific cases: 
Case 1: e = 0, T > b. We have 

1 -/"I 1 




2 -( 1 - 2-")6 


8(1 — 2“")6 d-^rilogn 

for 6=5 logn, and n —oo. To obtain an upper bound on 
Rb, consider the scheme where the local processor quantizes 
the sample mean X" = n~^ Si=i ^ using a uniform 

6 -bit quantization of [0,1], and the remote estimator sets W = 
W. By the triangle inequality, 

EjlF - 1?| < EjlF - X”| -h EjX’" - IFj < -f= -h 2-^ 

\/6n 

Thus, for 6 = 5 logn, i?B < l-dl/y^, which only differs 
from the lower bound by a logarithmic factor as n —> 00. 
Case 2: e > 0, 6 > log(n -|- 1). We have 

o ^ „ r Cl ^ 

Mb ^ max | n ’ 8(1 - k(e))T I 

aci 


> 


VaG [0,1], 


PTy/n'Tr logn 8(1 — 6 ,(£))r’ 

where pT = ^ — (dee)^, and ci is an absolute constant. 
Consider the scheme where the local processor computes the 
sample sum 5'", which is uniformly distributed on {0,..., n}, 
represents it with log(n-l-l) bits, and transmits these bits over 
the channel usin^ a blocklength-T code. The remote estimator 
decodes S'” as S", and sets W = S'^/n. Then 

E|1F-1F| < E|1F - S”/n| -h E|S”/n - S”/n| 


<f=+ns^fsn<f= 

von \'bn 


where Er{-) is the random coding error exponent of BSC(£:): 

Er{^ log(n -I- 1)) = 1 - log(l -I- v/d^) - ^ log(n -|- 1) 
when 7 ^ log(n + 1) < 1 — [14, p. 146]. Note that 

1 — h{e) 


1 < 


<2, VeG (0,i), 


1 — log(l -I- v/dee) ’ ^ ’ 2 > 

which implies that the error exponent in the lower bound can 
closely match that in the upper bound at low transmission rate. 











































IV. A RESULT FOR MULTIPLE PROCESSORS 


We now consider a set-up with m local processors. Each 
processor observes an independent set of samples generated 
from a common random parameter W, and communicates with 
the remote estimator over an independent noisy channel. For 
notational simplicity, we assume that each processor applies 
the same local transformation to its samples, and that the 
channels between each processor and the remote estimator 
have the same transition probabilities. The results can be 
straightforwardly generalized to the case where the local 
encoders and the channels are different across the processors. 

Theorem 3. In the multi-processor setup described above, 

< mmin|/(lV;X")77T, CT, 

{H (AT”) A b) max^ ?7(Px;, Pw, )vt } ■ 

Proof: Due to the independence assumption, the code¬ 
words received by the remote es¬ 

timator from the processors {1 ,...,to} are conditionally 
independent given W. This implies that < 

^(i)) (see, e.g., [1, Lemma 4]). Using Theorem 2 
to upper-bound each term, we obtain the result of Theorem 3. 


C{W,p) = log(l/>C(VL,p)) > d/6 for p < 

d/Q and d > 12. Thus from Theorem 1, 


inf ini^[lYv{W,W)\ > pfl — 


>iii- 


:(> 


ymxT) log 2 
logil/CiW,p)) 

_^l0g2 

“ d/6 


From Corollary 2 and Theorem 3, we have /(1U;U™^^) < 
inin{d, d}. Thus from (12), 


2dd^ / md'^riTid A b)log 2 

-d/e- 


With = min{l, d/(24mpT(d A 6))}, the quantity in the 
parentheses is > 1/4, and we obtain the desired result. ■ 
In the noiseless channel case. Corollary 7 recovers and 
improves the lower bound in Proposition 2 of [1], which can 
be achieved within a constant factor using a method described 
there. When the processors communicate to the estimator via 
noisy channels, the effect of the noise on the minimax risk is 
captured by pT in the denominator of the lower bound. 
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Using Theorem 3 with Theorem 1 and Corollary 2, we 
can obtain a lower bound on the minimax risk for estimating 
the mean of an unknown distribution P on X = [—1,1]'^, 
where each processor i G {!,..■ ,to} only observes a single 
independent sample ^ P. Let P denote family of all 
probability distributions on [—1,1]^^. For P G V, the parameter 
of P in this example is formally defined as 0{P) = Ep[A']. 
The minimax risk is defined as 

Rm= inf supEp[|10(P)-i/(U’"><^)f], 

pgp 

where -0 is an estimator of 0 G [—1,1]'^. 

Corollary 7. For the above minimax estimation problem, 

Proof: At a high level, the proof strategy follows that in 
[1]. But we use Theorem 1 instead of their distance-based 
Fano inequality, and we improve their mutual information 
upper bound by Corollary 2, which also captures the influence 
of noisy channels between the processors and the remote 
estimator. Let W, 6, and Pxj\Wj bs defined as in Example 2 
with n = 1. Given W = w, each processor observes a sample 
X with its coordinates drawn according to Px \w =w ■ Hence 
Px\w=w S P for all w G {±1}'^. Let 9w=diPx\w=w) = 
so that \\9w — = 4(5^£h {w,w'), where £h denotes the 

Hamming distance. Therefore, 

i?M > 4(5^ inf inf E[fH(H^, H^)], (12) 

vT’Vtp P 

where the second inflmum is now over all remote estimators 
ofkU G { —1,-|-1}‘^. Let p be a nonnegative integer. Then 
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Appendix A 
Proof of Theorem 1 

For any estimator ip, let P be the joint distribution of W and 
W, and Q be the product of the marginals of P. Define Pp = 
P{£(W, W) < p) and qp = Q{£{W, W) < p) for an arbitrary 
p > 0. Then, by data processing inequality of divergence, 

I{W-,W) =D{P\\Q) > d 2 {pp\\qp) >Pplog— -h{pp) 

Qp 

where the last inequality follows from the fact that 


Qp= [ [ < p}Pwidw)P^{dw) 

Jw J\N 

< sup ¥.[l{t{W,w) < p}] = C{W,p). 

iuGW 


Consequently, 


1 - Pp > 1 - 


I{W-W)p\og2 
\og{llL{W,p)) ■ 


(A.2) 


From the fact that £{W, W) > pl{£(W, W) > p}, we have 
E[iiW, W)] > pP{£{W, W)> p)= p{l - Pp). 
Lower bounding 1 — Pp with (A.2), we get 


E[f(fF,lL)] > supp 

p>0 


/(VF;1L) +log2 
\og{\/C{W,p)) 


The proof is completed 1^ taking the infimum over ip, and 
using the fact that I{W;W) < I{W;X). 


Appendix B 
Proof of Lemma 3 

Let Y be an arbitrary random variable such that Y U'^ 
form a Markov chain. Suppose ri{Pv\u) = P- It suffices 
to show that 

IiY-V^)<{l-{l-p)^)l{Y-U^). (B.3) 

From the chain rule, 

/(L; V^) = I{Y; + /(F; Vt\V^-^). 

Since Y,V'^~^ —>■ Ut —> Vt form a Markov chain, a 
conditional version of SDPI (Corollary 1 in [12]) gives 

KY- Vt\V^-^)< pliY- Ut\V^-^). 

It follows that 

I{Y- V^) < KY- + pI{Y- Ut\V^-^) 

= (1 - p)/(F; + pJ(F; V^-\ Ut) 

< {\-p)I{Y-V^-^)+pI{Y-U^), 

where the last step follows from the ordinary data processing 
inequality and the fact that Y —> U'^~^ —> 1 /^“^ form a 


Markov chain. Unrolling the above recursive upper bound on 
I{Y ; U^) and noting that I{Y ; Vi) < I{Y; Ui)p, we get 

/(F; V^) < (1 - p)^-^/(F; U,) + ... + 

(l-p)p/(F;C/^-i) + p/(F;U^) 

< ((1 - + ... + (1 - p) + l)p/(F; U'^) 

= (l-(l-p)^)/(F;U"), 

which proves (B.3) and hence Lemma 3. 

Appendix C 
Proof of (6) 

We have Pwj{wj) = \ for = ±1, and 


P 






1 + Wj5 


\ — Wj 5 


where s is the number of I’s in a;". Thus 


Pwpx^{wj\x'^) = 


{ 1+Tt;,<5 

i— 




^ ) l + /3^’ ^ ^ 

if Wi = — 1 


1 + /3 


— 2s+n ’ 


This gives 


\Pw\X^=x"' — -f’vF|X"=$’*l|TV 


If 

1 

1 

+ 

1 

1 

2 1 

1 + /32«-” 

1 + /32«-"’ 

1 + /3-2'^+" 

1 + /3-2S+" 


which is maximized by choosing a:" and such that s = 0 
and s = n. Hence 

1 1 1-/3” 

'diPWjlX") - pn 1 p-n ~ 1+ j3n- 

Appendix D 
Proof of (11) 

We have pw{w) = 1 for ic S [0,1], and Px^\w{x'^\w) = 
a(;*(l — where s is the number of I’s in Xn- Thus 

PxAx-)= f W^{l-w)-^dw= 2 

Jo (^+ 1 )(J 

and 

Pw\X’^iw\x'') = + 

This gives 

\\Pw\X^=x" - -fV|X"=$"||TV = 


n + 1 


/o 


w^{l-w)'' 


- w^l - tu)”- 


dui, 


which is maximized by choosing a;” and i" such that s = 0 
and s = n. Hence 

nl 


^iPw\x’^) = '^l 1(1-^)”- 


dw = 1 — 2 




























